Abstract
Background
Large language models (LLMs) coupled with retrieval-augmented generation (RAG) can deliver point-of-care, guideline-anchored answers for complex oncologic toxicities. When iteratively refined with domain experts, these systems may rival—or surpass—individual specialist performance.
Objectives
To present and benchmark a purpose-built module of Code Red (https://chatbot.codigorojo.tech/)—an educational, medicine-wide generative-AI project—designed specifically to improve the clinical management of toxicities from immune effector cell therapies, particularly CAR-T (e.g., CRS, ICANS). This module aims to provide near–real-time, reference-backed recommendations that have been iteratively refined with domain experts, addressing the current deviations from guideline-based care and the resulting heterogeneity in real-world practice.
Methods
Three simulated CAR-T toxicity cases were constructed to span varying grades and scenarios. Each case was independently answered by two CAR-T–experienced hematologist/oncologists (n=3 per case) and by Code Red. An external LLM (ChatGPT o3) served as a blinded adjudicator, applying a seven-item rubric—clinical accuracy/guideline concordance (45%), safety & risk mitigation (15%), completeness/contextualization (10%), actionability & clarity (10%), reference quality (10%), transparency about uncertainty (5%), and form/communication efficiency (5%). Each item was scored 0–10, scores were standardized across cases/raters, and then combined into a single weighted composite to select the “winner” per case. Code Red uses a RAG pipeline over a curated corpus of CAR-T toxicity guidelines and primary literature, plus rule-based safeguards for dosing and citation integrity.
Results
Using a seven-item, 0–10 rubric (weights: 45/15/10/10/10/5/5%), the standardized composite scores were Code Red 8.8/10, Expert 1: 8.0, Expert 2: 7.6, Expert 3: 5.3. Code Red exceeded the top individual expert by +0.8 points (~8% absolute gain) and the aggregated expert mean (≈7.0/10; median 7.6; range 5.3–8.0) by +1.8 points (~26% relative gain).
Criterion-wise, Code Red scored 10/10 in clinical accuracy/guideline concordance and in safety/risk mitigation, and ≥9/10 in completeness, actionability, and structure/efficiency; transparency about uncertainty was moderate (6/10). Experts, when averaged, reached 8.0 in accuracy, 8.0 in safety, 7.7 in completeness, 8.3 in actionability, 3.3 in transparency, and 7.0 in structure/efficiency.
After weighting, Code Red ranked first in every case, leading the blinded adjudicator to select it as the top response throughout. Its margin was driven by a consistently protocol-level presentation—explicit monitoring schedules, predefined intervention thresholds (e.g., tocilizumab at 24 h of persistent fever), steroid regimens, ICU escalation criteria, and tertiary options (anakinra/siltuximab)—delivered in concise, highly actionable language.
Conclusions
A lean, expert-guided RAG system (Code Red) can outperform individual CAR-T specialists on simulated toxicity management scenarios while delivering rapid guidance. Ongoing improvement will rely on continuous user feedback and automated literature surveillance to preserve patient safety and guideline fidelity.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal